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ABSTRACT 

Network operators today spend significant manual effort in 
ensuring and checking that the network meets their intended 
policies. While recent work in network verification has made 
giant strides to reduce this effort, they focus on simple reach¬ 
ability properties and cannot handle context-dependent poli¬ 
cies (e.g., how many connections has a host spawned) that 
operators realize using stateful network functions (NFs). To¬ 
gether, these introduce new expressiveness and scalability 
challenges that fall outside the scope of existing network 
verification mechanisms. To address these challenges, we 
present Armstrong, a system that enables operators to test if 
network with stateful data plane elements correctly imple¬ 
ments a given context-dependent policy. Our design makes 
three key contributions to address expressiveness and scala¬ 
bility: (1) An abstract I/O unit for modeling network I/O that 
encodes policy-relevant context information; (2) A practical 
representation of complex NFs via an ensemble of finite- 
state machines abstraction; and (3) A scalable application 
of symbolic execution to tackle state-space explosion. We 
demonstrate that Armstrong is several orders of magnitude 
faster than existing mechanisms. 

1 Introduction 

Network policy enforcement has been and continues to be 
a challenging and error-prone task. For instance, a re¬ 
cent operator survey found that 35% of networks gener¬ 
ate > 100 problem tickets per month and one-fourth of 
these take multiple engineer-hours to resolve (18) . In this 
respect, recent efforts on network testing and verification 
(e.g., offer a promising alternative to existing 

expensive and manual debugging efforts. 

Despite these advances, there are fundamental gaps be¬ 
tween the intent of network operators and the capabilities of 
these tools on two fronts: (1) data plane elements are com¬ 
plex and stateful (e.g., a TCP connection state in a stateful 
firewall) and (2) actual policies are context dependent ; e.g., 
compositional requirements to ensure traffic is “chained” 
through services (20][22| or dynamically triggered based on 
observed host behavior (21] . 

Together, stateful data planes and context-dependent poli¬ 
cies introduce new challenges that fall outside the scope of 
existing network checking mechanisms (42]|43][44|[63). To 
understand why, it is useful to revisit their conceptual basis. 
Essentially, they capture network behavior by modeling each 



data plane model 


Figure 1: Armstrong takes in high-level policy intent 
from the network operator and generates test cases to 
check the implementation of the policy. 


network functioi^(NF) (e.g., a switch) as a “transfer” func¬ 
tion T(h,port) that takes in a located packet (a header h 
and a port port ) and outputs another located packet]^] Then, 
some search algorithm (e.g., model checking or geometric 
analysis) is used to reason about the composition of these 
T functions. Specifically, we identify three key limitations 
with respect to expressiveness and scalability (Q: 

• Packets are cumbersome and insufficient: While lo¬ 
cated packets allow us to compose models of NFs, they 
are inefficient to capture higher-layer processing seman¬ 
tics (e.g., proxy at HTTP level). Further, in the presence 
of dynamic middlebox actions (36) , located packets lack 
the necessary context information w.r.t. a packet’s pro¬ 
cessing history and provenance, which are critical to rea¬ 
son about policies beyond reachability. 

• Transfer functions lack state and context: The trans¬ 
fer abstraction misses key stateful semantics; e.g., reflex¬ 
ive ACLs in a stateful firewall or a NAT using consistent 
public-private IP mappings. Moreover, the output actions 
of NFs have richer semantics (e.g., alerts) beyond a lo¬ 
cated packet that determine the policy-relevant context. 

• Search complexity: Exploring data plane behavior is 
hard even for reachability properties (42||44]|63). With 
stateful behaviors and richer policies, exploration is even 
more intractable and existing state-space search algo¬ 
rithms (e.g., model checking) can take several tens of 
hours even on small networks with < 5 stateful NFs. 


To address these challenges, we present a network testing 
framework called Armstrong (Figure [T}. We adopt active 


data plane testing to complement static verification [26,52 


|63| , because it gives concrete assurances about the behavior 
“on-the-wire” (63). Armstrong takes in high-level network 


! A network function may be stateless (i.e., switches/routers) or 
stateful (i.e., middleboxes) and can be physical or virtual. 

2 For concreteness, we borrow terminology from HSA [43); other 
efforts share similar ideas at their core [ 42 44||46| |52 |. 
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policies from the operator, generates and injects test traffic 
into the data plane, and then reports if the observed behav¬ 
ior matches the policy intent. Note that Armstrong is not 
(and does not mandate) a specific policy enforcement sys¬ 
tem [[9] [49] [53); rather it helps operators to check if the in¬ 
tended policy is implemented correctly. 

Armstrong’s design makes three key contributions to ad¬ 
dress the expressiveness and scalability challenges: 

• ADU I/O abstraction (^5j: We propose a new Arm¬ 
strong Data Unit (ADU) as a common denominator of 
traffic processing for network models. To improve scala¬ 
bility an ADU represents an aggregate sequence of pack¬ 
ets; e.g., a HTTP response ADU coalesces tens of raw IP 
packets. Furthermore, an ADU explicitly includes the 
necessary packet processing context; e.g., an ADU that 
induced an alarm carries this information going forward. 

• FSMs-ensemble model for NFs (^6]>: One might be 
tempted to use a NF’s code or a finite-state machine 
(FSM) model as a NF’s model, as they can capture state¬ 
ful behaviors. However, these are intractable due to the 
huge number of states and transitions (or code paths). To 
ensure a tractable representation, we model complex NFs 
as an ensemble ofFSMs by decoupling logically indepen¬ 
dent tasks (e.g., client-side vs. server-side connection in a 
NF) and units of traffic (e.g., different TCP connections). 

• Optimized symbolic execution workflow (^7]>: For 
scalable test generation, we decouple it into two stages: 
(1) abstract test plan generation at the ADU granularity 
using symbolic execution (SE) because of its well-known 
scalability properties (30[[3T) and (2) a translation stage 
to convert abstract plans into concrete test traffic. We 
engineer domain-specific optimizations (e.g., reduce the 
number and scope of symbolic variables) to improve the 
scalability of SE in our domain. 

We have written models for several canonical NFs in C 
and implement our domain-specific SE optimizations on top 
of KLEE. We prototype Armstrong as an application over 
OpenDayLight HD We implement simple monitoring 
and test validation mechanisms to localize the NF inducing 
policy violations (Q. Our evaluation (jpO) on a real testbed 
reveals that Armstrong: (1) can test hundreds of policy sce¬ 
narios on networks with hundreds of switches and stateful 
NFs nodes within two minutes; (2) dramatically improves 
test scalability, providing nearly five orders of magnitude re¬ 
duction in time for test traffic generation relative to straw- 
man solutions (e.g., using packets as NFs models I/O, or us¬ 
ing model checking for search); (3) is more expressive and 
scalable than the state of the art; (4) effectively localizes in¬ 
tentional data/control plane bugs within tens of seconds. 

2 Motivation 

In this section, we use small but realistic network scenarios 
to highlight the types of stateful NFs and context-dependent 
policies used by network operators. We also highlight key 
limitations of existing network test/verification efforts. To 
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Figure 2: Is the firewall allowing solicited and blocking 
unsolicited traffic from the Internet? 
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make the discussion concrete, we use the transfer function 
and located packet abstraction from HSA [43J/ATPG (63), 
where each network NF (e.g., a switch) is a “transfer” func¬ 
tion T(h,p) whose input is a located packet (a header, port 
tuple) and outputs another located packet^] The behavior 
of the network is the composition of such functions; i.e., 
T n {... (T 2 (Ti(/i, p)))). Our goal here is not to show the 
limitations of these specific efforts, but to highlight why the 
following scenarios fall outside the scope of this class of ver¬ 
ification techniques (e.g., |42||44||47) ). 

2.1 Stateful fire walling 

While simple firewalls and OpenFlow ACLs have a simple 
match-action operation, real firewalls capture TCP session 
semantics. A common use is reflexive ACLs @ shown in 
Figure [2] where the intent is to only allow incoming pack¬ 
ets for established TCP connections that have been initiated 
from “internal” hosts. We depict the intended policy shown 
as a simple policy graph shown on the top. 

Unfortunately, even this simple policy cannot be captured 
by a stateless transfer function T(h,p). In particular, the T 
behavior depends on the current state of the firewall for a 
given connection, and the function needs to update the rel¬ 
evant internal state variable. A natural extension is a finite- 
state machine (FSM) abstraction where T(h,p,s) takes in a 
located packet and the current state, outputs a located packet, 
and updates the state. In this case, the state is per-session, but 
more generally it can span multiple sessions (39). 

2.2 Dynamic policy violations 

Next, let us consider Figure [3j where the operator uses a 
proxy for better performance and also wants to restrict web 
access; e.g., H 2 cannot access to XYZ.com. As observed 
elsewhere (36) , there are subtle violations that could oc¬ 
cur if a cached response bypasses the monitoring device. 
Prior work has suggested many candidate fixes; e.g., bet¬ 
ter NF placement, tunnels, or new extended SDN APIs (36) . 
Our focus here is to check whether such policy enforcement 
mechanisms implement the policy correctly rather than de- 

3 For brevity, we assume no multicast/broadcast effects. 
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Figure 4: Are we fire walling correctly based on host? 



veloping new enforcement mechanisms. 

As before, we need to model the stateful behavior of the 
proxy across connections, so let us consider our extended 
function T(h,p,s). However, modeling the state alone is 
not sufficient. Specifically, the policy violations happen for 
cached responses, but this context (i.e., cached or not in this 
example) depends on some internal state variable inside the 
Tproxy function. To faithfully capture the policy intent of 
the operator in our network model, we need to expose such 
relevant traffic’s processing history in our model. This sug¬ 
gests that we need to further extend the functions to include 
context as input T(h,p, s, c) because the correct network 
behavior (e.g., downstream switches and middleboxes in our 
model) depends on this context. We formalized this defini¬ 
tions in ^3] 

This example also highlights several other issues. First, 
different NFs operate at different layers of the network 
stack; e.g., the monitoring device may operate at L3/L4 
but the proxy in terms of HTTP sessions, which makes the 
“atomic” granularity at which their policy-relevant states/- 
contexts manifest different. While it may be tempting to 
choose different granularities of traffic for different NFs, it 
means that we may no longer compose our T functions if 
their inputs are different. Second, the policy-relevant con¬ 
text depends on a sequence of packets rather than on an in¬ 
dividual packet. While it is not incorrect to think of T func¬ 
tions operating on packets, it is not an efficient abstraction. 
Finally, note that just using headers is not sufficient as the 
behavior of the proxy depends on the actual content. 


2.3 Firewalling with cascaded NATs 

Figure [4] depicts a scenario inspired by prior work that 
showed cascaded NATs are error-prone (28p0) . Note that a 
correct NAT should use a consistent public-private IP map¬ 
ping for a session (59). To model such network behaviors, 
we need to both capture the packet provenance (i.e., where 
it originated from) and the consistent mapping semantics. 

Unfortunately, existing tools such as HSA/ATPG essen¬ 
tially model stateful NFs as “black box” functions and do 
not capture or preserve the flow consistent mapping prop¬ 
erties. This has two natural implications for our extended 
transfer function T(h,p, s, c): (1) the context c should also 
include the packet provenance, and (2) the function T must 
be expressive enough to capture stateful NFs semantics (e.g., 
session-consistent mappings). 
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Figure 5: Is suspicious traffic sent to heavy IPS? 


2.4 Multi-stage triggers 

So far our examples underlined the need for capturing state¬ 
ful semantics and relevant context inside a transfer function. 
We end this discussion with a dynamic service chaining ex¬ 
ample in Figure [5] that combines both effects. The intended 
policy is to use the light-weight IPS (L-IPS) in the common 
case (i.e., for all traffic) and only subject suspicious hosts 
flagged by the L-IPS (e.g., when a host generates too many 
scans) to the more expensive H-IPS (e.g., for payload signa¬ 
ture matching). Such multi-stage detection is useful; e.g., 
to minimize latency and/or reduce the H-IPS load. Such 
scenarios are implemented today (albeit typically by hard- 
coding the policy into the topology) and enabled by novel 
SDN-based dynamic control mechanisms (9||23). Unfortu¬ 
nately, we cannot check that this multi-stage operation works 
correctly using existing reachability mechanisms |43|63) be- 
cause they ignore the IPSes states (e.g., the current per-host 
count of bad connections inside the L-IPS) and traffic con¬ 
text related to the sequence of intended actions. 

Finally, note that the above examples have natural impli¬ 
cations for a search strategy to explore the data plane be¬ 
havior. Prior exhaustive search strategies were possible only 
because a transfer function processes each “header” inde¬ 
pendently and had no state. Thus they only had to search 
over the “header space”. Note that this is already hard and 
requires clever algorithms (63) and/or parallel solvers (64) . 
Designing a search strategy for the examples above is fun¬ 
damentally more challenging because we need to consider a 
bigger “traffic” space (i.e., sequences of packets with pay- 
loads) and we need to efficiently explore a state space since 
processing of a packet by an NF (e.g., a stateful firewall) can 
change the behavior of the data plane for future packets. 

2.5 Key observations 

We summarize key expressiveness and scalability challenges 
that fall outside the scope of existing network verification 
abstractions and search strategies: 

• NFs are stateful (e.g., 0 and have complex semantics 
beyond simple header match-action operations, and ab¬ 
stracting them as blackboxes is insufficient (e.g., & 

• NF actions are triggered on sequences of packets and oc¬ 
cur at different logical aggregations (e.g., §2.2) ; 

• The correct behavior depends on traffic context such as 
provenance and processing history (e.g., §2.2| and §2.3) ; 
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• The space of possible outcomes in the presence of state¬ 
ful data planes operating over richer semantics (e.g., pay- 
load) and context-dependent policies can be very large 
(e.g., 

3 Problem Formulation 

In this section, we define the semantics of a stateful data 
plane using which we formalize a test trace to test the in¬ 
tended policies in a data plane. We then use these definitions 
to motivate the need for a model-based testing approach. 

3.1 Data Plane Semantics 

In this sub-section we formalize the semantics of stateful 
data planes and context-dependent policies. This formaliza¬ 
tion serves two purposes: (1) an understanding of the data 
plane semantics, where actual traffic is processed, provides 
insight into the methodology of generating test traffic; in 
particular, as we will see in this section, this formalization 
motivates the need for modeling the data plane to bridge 
the gap between a high-level policy and its manifestation in 
the data plane; (2) it serves as a reference point for the fu¬ 
ture research in the area of stateful data planes and context- 
dependent policie^] 

DPF: Since test traffic operates on the data plane level, in 
this sub-section we define the data plane semantics. First, we 
define the semantics of a NF and the network. Let V denote 
the set of packets]^] Formally, a NF is a 4-tuple (S', /, E, 5) 
where: (i) S' is a finite set of states; (ii) I G S' is the initial 
state; (iii) E is the set of network edges; and (iv) S : S' x V 
S'xFxFxEis the transition relation. 

Here, E is a set of effects that capture the response of 
a NF to a packet. Each a G E provides contextual in¬ 
formation that the administrator cares about. Each a is 
annotated with the specific NF generating the effect and 
its relevant states; e.g., in Figure [5] we can have aq = 
(LIPS : Hi , Alarm , SendToHIPS) when the LIPS raises 
an alarm and redirects traffic from Hi to the H-IPS, and 
0(2 = (LIPS : iTi, OF, SendToInternet) when the LIPS 
decides that the traffic from Hi was OK to send to the In¬ 
ternet. Using effects, administrators can define high level 
policy intents rather than worry about low-level NF states. 
Note that this NF definition is general and it encompasses 
stateful NFs from the previous section and stateless L2-L3 
devices. 

Network: Formally, a network data plane net is a pair 
(A, r) where N = { NFi ,..., NF n } is a set of NFs and r 
is the topology map. Informally, if r(e) = NFi then pack¬ 
ets sent out on edge e are received by TVF^ We assume 
that the graph has well-defined sources (with no incoming 

4 Previous work has modeled network semantics without focusing 
on stateful data planes and context-dependent policies (e.g., |49[ 

JD> m 

Packets are “located” 143 |55], so that the NF can identify and use 
the incoming network interface information in its processing logic. 
6 We assume each edge is mapped to unique incoming/outgoing 
physical network ports on two different NFs. 


edges), and one more sinks (with no outgoing edges). The 
data plane state of net is a tuple a = (si,..., sn), where Si 
is a state of NFi. 

Processing semantics: To simplify the semantics of packet 
processing, we assume packets are processed in a lock-step 
(i.e., one-packet-per-NF-at-time) fashion and do not model 
(a) batching or queuing effects inside the network (hence no 
re-ordering and packet loss); (b) parallel processing effects 
inside NFs; and (c) the simultaneous processing of different 
packets across NFs. 

Let a = (si,..., Si ,,.., s/v) and a' = 
(si,..., sf ..., sn) be two states of net. First, we 
define a single-hop network state transition from 
(<r, z, 7r) to (cr', z', 7r') labeled by effect ck, denoted 
(cr,z,7r) (cV, z',7r') if 5i(si, 7r) = (s!, 7r', e, a), with 

NF^ = r(e). A single-hop network state transition 
represents processing of one packet by NFi while the 
state of all NFs other than NFi remains unchanged. For 
example, when the L-IPS rejects a connection from a 
user, it increments a variable tracking the number of failed 
connections. Similarly, when the stateful firewall sees a new 
three-way handshake completed, it updates the state for this 
session to connected. 

Next, we define the end-to-end state transitions that a 
packet 7r m entering the network induces. Suppose 7r m tra¬ 
verses a path of length n through the sequence of NFs 
NF n ,..., NF in and ends up in NF in+1 (note that the 
sequence of traversed NFs may be different for differ¬ 
ent packets). Then the end-to-end transition is a 4-tuple 
(<ji, 7r m , (aq,..., a n ), cr n+ 1 ) such that there exists a se¬ 
quence of packets 7Ti,..., 7r n+ i with 7iq = 7r m , and a se¬ 
quence of network states cr 2 ,..., <r n _i such that VI < k < 

n: (<7fc, Zfc, 7Tfc) (&k+ 1, 4+i,tt/c+i)- 

That is, the injection of packet 7r m into NF i± when 
the network is in state oq causes the sequence of effects 
(oli ,..., a n ) and the network to move to state <J n +i , through 
the above intermediate states, while the packet ends up in 
NFi n+ i . For instance, when the L-IPS is already in the 
toomanyconn-1 state for a particular user and the user 
sends another connection attempt, then the L-IPS will tran¬ 
sition to the toomanyconn state and then the packet will 
be redirected to the H-IPS. 

Let E2ESem(net ) denote the end-to-end “network se¬ 
mantics” or the set of feasible transitions on the network net 
for a single input packet. 

Trace semantics: Next, we define the semantics of process¬ 
ing of an input packet trace n = 7rf n ,..., 7r^. We use a to 
denote the vector of NF effects associated with this trace; 
i.e., the set of effects across all NFs in the network. The net¬ 
work semantics on a trace n is a sequence of effect vectors: 
TraceSemu = (aq,..., a rn ) where VI < k < m: it™ G 
V f\dk G E + . This is an acceptable sequence of events iff 
there exists a sequence oq, ..., cr m+ 1 of states of net such 
that: VI < k < m: (oq, 7r^ n , (?k+i) £ E2ESem(net). 
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Policies in the data plane: Given the notion of trace seman¬ 
tics defined above, we can now formally specify our goal in 
developing Armstrong. At a high-level, we want to test a 
policy. Formally, a policy is a pair ( TraceSpec ; TraceSem ), 
where TraceSpec captures a class of traffic of interest, and 
TraceSem is the vector of effects of the form (di... dm) 
that we want to observe from a correct network when in¬ 
jected with traffic from that class. Concretely, consider two 
policies: 

1. In Figure [3] we want: “Cached web responses to Deptl 
should go to the monitor”. Then, TraceSpec captures 
web traffic to/from Deptl and TraceSem = (a^a^), 
with ol\ = Proxy : Deptl , CachedObject and = 
Proxy : Deptl , SendToMon. 

2. In Figure [5] we want: “If host Hi contacts more 
than 10 distinct destinations, then its traffic is sent 
to H — IPS”. Then, TraceSpec captures traffic 
from Hi, and TraceSem = (ai^af) where ai = 
L — IPS :Hi, MorethanlOScan, and = L — IPS : 
Hi , SendtoHIPS. 

Test trace generation: Our goal is to check whether such a 
policy is satisfied by the actual network. More specifically, 
if we have a concrete test trace II that satisfies TraceSpec u 
and should ideally induce the effects TraceSemu, then the 
network should exhibit TraceS emu when II is injected into 
it. In other words, the goal of Armstrong in terms of test 
traffic generation is to find a concrete trace that satisfies 
TraceSpecu- 

3.2 Challenges of automatic test traffic gener¬ 
ation 

The vision of Armstrong involves automating this test traf¬ 
fic (i.e., a set of test traces corresponding to all policies) 
generation. In an attempt to do so, however, we are facing 
two challenges. First, operators often define policies using 
a high-level representation, similar to what we saw in [2| as 
opposed to the complex form ( TraceSpec ; TraceSem ) that 
involves low-level intricacies of each NF (i.e., (S , I , E,5)). 
The challenge of test traffic generation is as follows. Given 
a policy, how to find concrete test traffic, out of very many 
possible distinct traces, that satisfies TraceSpec u - 

In the next two sections we will discuss how Armstrong 
overcomes these challenges: (1) ^6] will discuss how to NF 
models are used to bridge the gap between high-level poli¬ 
cies and low-level data plane semantics; (2) ^7] then will 
show how to systematically conduct search on the data plane 
model using symbolic execution to generate test traffic. 

4 Armstrong Overview 

In this section, we give an overview of Armstrong describ¬ 
ing the key components and design ideas to address the chal¬ 
lenges described at the end of the previous section. 

Problem scope: Armstrong’s goal is to check if an oper¬ 
ator’s intended policy is implemented correctly in the data 
plane. (Armstrong does not mandate a specific control- or 


data-plane policy enforcement mechanism (^ |26p6|[49||53| l, 
and our focus in this work is not on designing such a mech¬ 
anism.) In this respect, there are two complementary classes 
of approaches: (1) Static verification (e.g., HSA (43) , Veri- 
flow (44) , Vericon (26) ) in which a model of the network is 
given to a verification engine that checks if the configuration 
meets the policy (or produces a counterexample); and (2) 
Active testing (e.g., ATPG (63)), where test traffic is injected 
into the network and check if the observed behavior is con¬ 
sistent with the intended policy. From a practical view, ac¬ 
tive testing can detect implementation problems that is out¬ 
side the scope of static verification; a bug in the firewall im¬ 
plementation or the middlebox orchestration logic (39}|54) . 
Thus, we adopt an active testing approach in Armstrong. 
That said, our modeling contributions will also improve the 
scalability of static verification. 

Scope of policies: For concreteness, we scope policies 
that Armstrong can (and cannot) check. In Armstrong, 
a policy is defined as a set of policy scenarios. A pol¬ 
icy scenario is a 3-tuple (TraceSpec; policy Path; Action). 
TraceSpec specifies the traffic class (e.g., in terms of 5- 
tuple) to which the policy is related (e.g., srcIPeDept, 
proto=TCP, and dstPort=80 in Figure [3]), policyPath is 
the intended sequence of stateful NFs that the traffic 
needs to go through along with the relevant context (e.g., 
provenance =#2 and proxyContext=<hit,XYZ.com>) and 
Action is the intended final action (e.g., Drop) on any traffic 
that matches TraceSpec and policyPath. The intended pol¬ 
icy of Figure [3] captures three such different possibilities for 
the intended behavior, namely, one ending in action Allow, 
and two (i.e., for hit and miss) ending in action Drop when 
H 2 tries to get XYZ.com, so the intended policy corresponds 
to three policy scenarios. 

Other properties like checking performance, crash- 
freedom, infinite loops inside NFs, and race condtions are 
outside the scope of Armstrong. Similarly, if there are 
context/state behaviors outside the Armstrong models, then 
Armstrong will not detect those violations. 

Design space and strawman solutions: Given the com¬ 
plexity of stateful NFs and context-dependent policies, it 
will be tedious for an operator to manually reason about their 
interactions and generate concrete test cases to check the 
data plane behavior. In a nutshell, the goal of Armstrong is 
to simplify the operators workflow so that they only need to 
specify high-level policy scenarios such as the policies from 
the previous section. Armstrong automatically generates test 
traffic to exercise each given policy scenario to simplify the 
process of validating if the data plane correctly implements 
the operator’s intention. 

In a broader context, Armstrong is an instance of a 
specification-based or model-based testing paradigm (60) . 
Any model-based testing solution needs a to bridge the se¬ 
mantic gap between the high-level intended behavior of the 
system (in case of Armstrong, high-level policies and the 
actual system behavior (in case of Armstrong, running code 


5 



and hardware in the data plane). A specific solution can be 
viewed in terms of a design space involving three key com¬ 
ponents: (1) A basic unit of input-output (I/O) behavior; (2) 
A model of the expected behavior of each component; and 
(3) Some way to search the space of end-to-end system be¬ 
haviors to generate test cases. We, therefore, can represent a 
point from the design space as a 3-tuple with specific designs 
for each component. 

To see why it is challenging to find a solution that 
is both expressive and scalable , let us consider two 
points from this design space. At the one end of the 
spectrum, we have prior work like ATPG (63) with: 
(I / O = LocPkt , Model = Stateless , Search = Geometric) 
As argued earlier, these are not expressive. At the opposite 
end we consider running model checking on implemen¬ 
tation source code and use packets as the I/O unit; i.e., 
(1/0 = Pkt , Model = Impl , Search = MC). While this 
can be expressive (modulo the hidden contexts), it is not 
scalable given that actual NF code can be tens of thousands 
of lines of code since model checking tools struggle be¬ 
yond a few hundred lines of code. Furthermore, using a 
NF implementation code as its model is problematic, as 
implementation bugs can defeat the purpose of testing by 
affecting the correctness of test cases Q 

High-level approach: Our contribution lies in design 
choices for each of these three dimensions that combine to 
achieve scalability and expressiveness: 

• I / 0 = ADU (§[5] I: We introduce a novel abstract net¬ 
work data unit called an Armstrong Data Unit (ADU) 
that improves scalability of test traffic generation via traf¬ 
fic aggregation and addresses expressiveness by explic¬ 
itly capturing relevant traffic context; 

• Model = FSMEnsemble (©: We model each NF as an 
ensemble of FSMs and compose them to model the data 
plane. Here, using FSMs as building blocks enables the 
stateful model and breaking a monolithic FSM into the 
ensemble dramatically shrinks the state space; 

• Search = Optimized Symbolic Execution (©: Given 
our goal is to generate test traffic, we can sacrifice ex¬ 
haustive searching and use more scalable approaches like 
symbolic execution (SE) rather than model-checking. 
However, using SE naively does not handle large topolo¬ 
gies and thus we implement domain-specific optimiza¬ 
tions for pruning the search space. 

Note that these decisions have natural synergies; e.g., 
ADUs simplify the effort to write NF models and also im¬ 
proves the scalability of our SE step. 

End-to-end workflow: Putting these ideas together, Fig¬ 
ure [6] shows Armstrong’s end-to-end workflow: The oper¬ 
ator defines the intended network policies in a high-level 
form, such as the policy graphs shown on top of each fig¬ 
ure of ^2] Armstrong uses a library of NF models, where 

7 There is also the pragmatic issue that we may not have the actual 
code for proprietary NFs. 
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each model works at the ADU granularity. Given the library 
of NF models and the network topology specification (with 
various switches and middleboxes), Armstrong constructs a 
concrete network model for the given network. Then, it uses 
the network model in conjunction with the policies to auto¬ 
matically generate concrete test traffic. Here, we decouple 
the test traffic generation into two logical steps by first run¬ 
ning SE on the data plane model to generate abstract (i.e., 
ADU-level) test traffic and then using a suite of traffic injec¬ 
tion libraries to translate this abstract traffic into concrete test 
traffic via test scripts. Finally, we use a monitoring mecha¬ 
nism that records data plane events and analyzes them to de¬ 
clare a test verdict to the operator (i.e., success, or a policy 
violation along with the NF in charge of the violation). 

Note that operators do not need to be involved in the task 
of writing NF models or in populating the test generation 
library. These are one-time offline tasks and can be aug¬ 
mented with community efforts m- 

5 ADU Input-Output Abstraction 

In this section, we present our ADU abstraction for model¬ 
ing NF I/O operations and show how it enables scalability 
and expressiveness, while still acting as a common denom¬ 
inator across diverse NFs. We discuss the implications of 
this choice for the design of the NF models and our search 
strategy. We end the section with guidelines and a recipe for 
extending ADUs for future scenarios. 

Key ideas: Concretely, an ADU is simply a struct as shown 
in Listing [I] Our ADU abstraction extends located packets 
from prior work in two key ways: 

• Traffic aggregation: First, each ADU can represent a se- 


Listing 1: ADU structure. 

struct ADU{ 

// IP fields 

int srcIP, dstIP, proto; 

// transport 

int srcPort, dstPort; 

// TCP specific 

int tcpSYN, tcpACK, tcpFIN, tcpRST; 

// HTTP specific 

int httpGetObj, httpRespObj; 

// Armstrong -specific 

int dropped, networkPort, ADUid; 

// Each NF updates traffic context 
int c-Tag[MAXTAG]; 

}; 
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quence of packets rather than an individual packet. This 
enables us to represent higher-layer operations more ef¬ 
ficiently; e.g., state inside an NF (e.g., a TCP connec¬ 
tion’s current state on a firewall) is associated with a set 
of packets rather than a single IP packet. As another ex¬ 
ample, a proxy’s cache state transitions to a new “rele¬ 
vant state” (i.e., cached state with respect to an object) 
only after the entire payload has been reassembled. 

• Explicitly binding the context: Each ADU is explicitly 
bound to its relevant context through the c-Tags field. 
Conceptually, c-Tags ensure that the ADU carries its 
“policy-related processing history” as it goes through 
the network. The natural question is what should these 
c-Tags capture? Building on our insights from the mo¬ 
tivating examples of ^2] c-Tags contain two types of 
information: (1) ADU’s provenance (i.e., it’s origin that 
may be otherwise hidden, for example, after a NAT), (2) 
NF processing context for the intended policies (e.g., 1 
bit for cache hit/miss, 1 bit for alarm/no-alarm). Con¬ 
cretely, a c-Tag is the union of different fields to em¬ 
bed relevant context w.r.t. different NFs that the ADU 
has gone through and the ADU provenance. 

Implications for NF models and test generation: The 

ADU abstraction has natural synergies and implications for 
both NF models and test traffic generation. First, ADUs help 
reduce the complexity of a NF’s models by consolidating 
protocol semantics (e.g., HTTP, TCP) and effects involving 
multiple IP packets. For example, all packets correspond¬ 
ing to an HTTP reply are represented by one ADU with the 
ht t pRe spOb j field indicating the retrieved object id. Note 
in particular that the struct fields are a superset of required 
fields of specific NFs; each NF processes only fields rele¬ 
vant to its function (e.g., the switch function ignores HTTP 
layer fields of input ADUs—see ^6]) Second, w.r.t. our test 
traffic generation, by aggregating multiple packets, ADUs 
reduce the search space for model exploration tools such as 
SE (^7J). That said, they introduce a level of indirection be¬ 
cause the output of SE cannot be directly used as a test trace 
and thus we need the extra translation step before we can 
generate raw packet streams. 

Designing future ADUs: Given the continued evolution of 
NFs and policies, a natural question is how can we extend the 
basic ADU. While we cannot claim to have an ADU defini¬ 
tion that can encompass all possible network scenarios and 
policy requirements, we present a high-level design roadmap 
that has served us well. First, the key to determining the 
fields of an ADU is to identify policy-related network proto¬ 
cols in all NFs of interest. For example, each of TCP SYN, 
TCP SYN+ACK, etc. make important state transitions in a 
stateful firewall and thus should be captured as ADU fields. 
The key point here is that our ADU abstraction is future- 
proof; e.g., if we decide to add an ICMP field to the ADU 
of Listing [T] (e.g., because our new policy involves ICMP on 
some new NF models), this is not going to affect existing 
NF models, as they simply ignore this new field. The second 


point is to consider a conservatively large c-Tag field to 
accommodate various types of relevant traffic context (e.g., 
sufficient number of bits to allow representation of different 
types of IPS alarms, as opposed to having 1 bit for capturing 
alarm/no-alarm in c - T a g). 

6 Modeling the Data Plane 

In this section, we begin by exploring some seemingly natu¬ 
ral strawman approaches to model each NF (§ |6.1| ) and then 
present our idea of modeling NFs as ensembles of FSMs by 
decoupling an NF’s actions based on logically independent 
units of traffic and internal tasks ( §6.2| ). 

6.1 Strawman solutions 

To serve as a usable basis for automatic test traffic gen¬ 
eration, a NF model needs to be scalable, expressive, and 
amenable to composition for network-wide modeling. Given 
the composability requirement, we first rule out very “high- 
level” models such as writing a proxy in terms of HTTP ob¬ 
ject requests/responses (35j. This leaves us two options: us¬ 
ing the code or the FSM abstraction we alluded to ^2] 

1. Code as “model”: This choice seems to remove the bur¬ 
den of explicit modeling, but such a model is too com¬ 
plex. For instance, Squid fT6| has > 200K lines of code 
and introduces other sources of complexity that are irrel¬ 
evant to the policies being checked. Another fundamen¬ 
tal issue with this choice would be that a bug in the to- 
be-tested implementation code affects the correctness of 
test traffic generated from such model! In summary, this 
approach yields expressive “models”, but is not scalable 
for exploring the search space. 

2. Write an NF model as a monolithic FSM: ^already sug¬ 
gests that FSMs may be a natural extension to the state¬ 
less transfer functions. Thus, we can consider each NF 
as an FSM operating at the ADU granularity. That is we 
can think of the current state of a stateful NF as vector of 
state variables (e.g., in proxy this vector may have three 
elements: per-host connection state, per-server connec¬ 
tion state, and per-object cache state). Again, this is not 
scalable; e.g., a stateful NF with S types of state with V 
possible values, means this “giant” FSM has V s states. 

Based on this discussion, we adopt FSMs as a natural 
starting point to avoid the logical problems associated with 
using code. Next we discuss how we address the scalability 
challenge. 

6.2 Tractable models via FSM ensembles 

Our insight is to borrow from the design of actual NFs. In 
practice, NF programs (e.g., a firewall) do not explicitly enu¬ 
merate the full-blown FSM. Rather, they independently track 
the states for “active” connections. Furthermore, different 
functional components of an NF are naturally segmented; 
e.g., client- vs. server-side handling in a proxy is separate. 
This enables us to simplify a monolithic NF FSM into a more 
tractable ensemble of FSMs along two dimensions: 
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(a) A monolithic FSM model of a (b) A per-connection FSM 
stateful firewall w.r.t two TCP con- model of a stateful firewall, 
nections. 


Figure 7: Illustrating how decoupling independent traf¬ 
fic units reduces number of states. 


Listing 2: Proxy as an ensemble of FSMs. 


1 

ADU Proxy(NFId id, ADU inADU){ 

3 

if ((frmClnt(inADU)) && (isHttpRq(inADU))){ 

4 

if (leached(id, inADU)){ 

5 

if (srvConnEstablished(id, inADU)) 

6 

outADU=rqstFrmSrv(id, outADU); 

7 

else 

8 

outADU=tcpSYNtoSrv(id, inADU); 

\ 

10 

J 

} 

11 

/*set c-Tags based on context (e.g., hit/miss) */ 

12 

outADU.c-Tags = ... 

13 


14 

return outADU; 

15 

} 


(SYN \r )/SYN_ACK to client _ 

<<N ULL, !X, ES T>)-K<f YN~ACK_SENT, !X,~gS T>> 

(-,X Received)/cocheX 
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(a) A monolithic FSM model of a (b) A proxy model as an FSM 
proxy w.r.t. a client, a server, and ensemble, enabled by decou- 
an object. pling independent NF tasks. 

Figure 8: Illustrating how decoupling independent NF 
tasks reduces number of states. 

• Decoupling independent traffic units: Consider a 
stateful firewall. The naive approach in modeling it as a 
monolithic FSM is shown in Figure [7a| While this is an 
expressive model, it is not scalable as the number of con¬ 
nections grow. We decouple this into independent per- 
connection FSMs as shown in Figure [7b] yielding and 
the firewall model as an ensemble of FSMs 0 

• Decoupling independent tasks: To see this idea, 
consider a proxy which is instructive, as it operates 
on a session layer, terminates sessions, and it can 
respond directly with objects in its cache. The code of 
a proxy, e.g., Squid, effectively has three modules: 
TCP connections with the client, TCP connection with 
the server, and cache. The proxy FSM is effectively 
the “product” of these modules (Figure [8a]). However, 
we can decouple different tasks; i.e., client-, server-side 
TCP connections, and cache. Instead of a “giant” FSM 
model with each state being of the “cross-product” form 
(client_TCP_state, server_TCP_state , cache_content ), 
we use a ensemble of three small FSMs each with 

a single type of state, i.e., (client_TCP_state), 
(server _TCP _st ate) , and (cache _content) (Fig¬ 
ure [8b]) 0 


8 In general, if the number of connections is \conn\ (2 in this ex¬ 
ample) and the number of states per connection is \state] (4 in this 
example), it is easy to see that this insight cuts the number of states 
from \state]\ conn \ to \conn\ x \state]. 

9 Concretely, if an NF has |Tj independent tasks (e.g., 3 for proxy) 
where the it h task has Si states (e.g., 2 for the cache task in 
this example) the ensemble cuts down the number of states from 


Note that these ideas are complementary and can be com¬ 
bined to reduce the number of states. For instance, if our 
proxy is serving two clients talking to two separate servers, 
we can first decouple states at the task-level and further de¬ 
couple the states within each task at the connection level. 

To see this concretely, Listing[2]shows a partial code snip¬ 
pet of a proxy model, focusing on the actions when a client is 
requesting a non-cached HTTP object and the proxy does not 
currently have a TCP connection established with the server. 
Here the id allows us to identify the specific proxy instance. 
The specific state variables of different proxy instances are 
inherently partitioned per NF instance (not shown). These 
track the relevant NF states, and are updated by the NF- 
specific functions such as srvConnEstablished[^] If 
the input inADU is a client HTTP request (Line [3]), and if 
the requested object is not cached (Line [4]), the proxy checks 
the status of TCP connection with the server. If there is an 
existing TCP connection with the server (Line [5]), the output 
ADU will be a HTTP request (Line [6]). Otherwise, the proxy 
will initiate a TCP connection with the server (Line [8]). 

Context processing: The one remaining issue is the propa¬ 
gation and updation of the context information in our model 
network. As we saw in ^5] each NF encodes the relevant 
context in the c-Tag field of the outgoing ADU (Line [12]). 
For instance, if an NF modifies headers, then the ADU en¬ 
codes the provenance of the ADU which can be used to 
check if the relevant policy at some downstream NF is imple¬ 
mented correctly. In summary, each NF is thus modeled as 
an FSM ensemble that receives an input ADU and generates 
an output ADU with the corresponding updated c-Tags. 

6.3 Network-wide modeling 

Given the per NF models, next we discuss how we compose 
these models to generate network-wide models. To make 
this discussion concrete, we use the network from Figure [3] 
and see how we compose the proxy, switch, and monitor 
models in Listing [3] 

Each NF instance is identified by a unique id that allows 
us to know the “type” of the NF and thus index into the 

10 This choice of passing “id”s and modeling the state in per-id 
global variables is an implementation artifact of using C/KLEE, and 
is not fundamental to our design. 
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relevant variables. Lines EHU model the stateless switch. 
Function lookup takes the input ADU, looks up its for¬ 
warding table, and creates a new out ADU with its port 
value set based on the forwarding table. Given the opera¬ 
tors policy, parameters of the network and each NF model 
are configured. For example, given the policy of Figure [3j 
hostToWatch is set to H2. As another example, given 
the policy of Figure [5j the alarm threshold in the L-IPS is 
configured to 10. Following prior work (43| , we consider 
each switch NF as a static data store lookup updating located 
packets. Lines [L3}|24] capture the monitoring NF. Given the 
actual network’s topology and the library of NFs models, 
this composition is completely automatic and does not re¬ 
quire the operator to “write” any code. Given the network 
topology Armstrong can identify the instance of Next_DPF 
in line [31]) of Listing [3] 

Similar to prior work (43}[63| l, we consider a network 
model in which packets are processed in a one-packet-per- 
NF-at-time fashion. That is, we do not model (a) batching 
or queuing effects inside the network, (b) parallel processing 
effects inside NFs or (c) simultaneous processing of different 
packets across NFs. Since our goal is to look for “policy” vi¬ 
olations represented in terms of NF-context sequences, this 
assumption is reasonable. Based on this semantics, the data 
plane as a simple loop (Line [31]). Note that because ADUs 
extend the located packet abstraction, they also capture the 
locations via (e.g., networkPort in Listing [I]). This al¬ 
lows NFs to be easily composed similar to the composition 
of the simple T functions (43| . 

In each iteration, an ADU is processed (Line [32]) in two 
steps: (1) the ADU is forwarded to the other end of the cur¬ 
rent link (Line [33]) , (2) the ADU is passed as an argument 
to the NF connected to this end (e.g., a switch or firewall) 
(Line [34]) . The ADU output by the NF is processed in 
the next iteration until the ADU is “DONE”; i.e., it either 
reaches its destination or gets dropped by a NF^The role 
of assert will become clear in the next section when we 
use symbolic execution to exercise a specific policy behav¬ 
ior. 


6.4 Writing future NF models 

We have manually written a broad range of NF models. 
While we do not have an algorithm for writing a NF model, 
we can provide design guidelines for writing future NF mod¬ 
els based on our own methodology. We begin by enumer¬ 
ating the set of policy scenarios (e.g., as in the examples 
of this enumeration step can be a broader community 
effort in future |11|121||29| . Across the union of these sce¬ 
narios, we identify the necessary contexts (e.g., alarm, cache 
hit) and corresponding NF states that affect these contexts 
(e.g., TCP state machine of firewall, cache contents). This 
gives us a set of model requirements. Then for each type 

11 Since NFs can be time-triggered (e.g., TCP connection time-out), 
we capture time using an ADU field. These “time ADUs” are in¬ 
jected by the network model periodically and upon receiving a time 
ADU, the relevant NFs update their time-related state. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 


Listing 3: Network pseudocode for Figure |3j 

// Symbolic ADUs to be instantiated (see 47]) . 

ADU A[2 0]; 

int objldToWatch = XYZ.com 
int hostToWatch = H2; 

// Global state variables 

bool Cache[2] [100]; // 2 proxies, 100 objects 
// Model of a switch 
ADU Switch(NFId id, ADU inADU){ 
outADU=lookUp(id, inADU); 
return outADU; 

} 

// Model of a monitoring NF 
ADU Mon(NFId id, ADU inADU){ 

outADU = inADU; 

if (isHttp(id, inADU)){ 

takeMonAction(id, inADU);/* if inADU 
contains objldToWatch destined to 
hostToWatch, set outADU. dropped to 1. */ 

} 

return outADU; 

// Model of a proxy NF; See Listing [7| 

ADU Proxy(NFId id, ADU inADU){ 

} 

main(){ 

// Model of the data plane 
for each injected A[i]{ 
while ( [DONE(A[i])) { 

Forward A[i] on current link;{ 

A[i] = Next_DPF(A[i]) ; { 
assert( 

(!(A[i].c-Tags[provenence]==hostId[H2])) 

||(!(A[i].c-Tags[cacheContext]==objIdToWatch)) 
||(!A[i].port==MonitorPort)); 

} 

} 

} 


of NF, we start with a “dumb” switch abstraction and incre¬ 
mentally add the logic to the model to capture the expected 
behaviors of the NF w.r.t. these required states and contexts; 
e.g., for a NAT we add per-flow consistent mapping behav¬ 
iors and packet provenance context. In doing so, we make 
sure to identify the opportunities for decoupling independent 
tasks and traffic units to enable the scalable ensemble repre¬ 
sentation. While we are not aware of automated tools for 
synthesizing middlebox models, recent advances in program 
analysis and software engineering might be a promising av¬ 
enue for automating model synthesis (e.g., (27)). 

7 Test Traffic Generation 

In this section, we describe how we use the network-wide 
model and operator’s policy to generate concrete test traf¬ 
fic to exercise policy-relevant data plane states. For Arm¬ 
strong to be interactive for operators, we want this step to 
be scalable enough to produce test plans within seconds to a 
few minutes even for large networks. Unfortunately, several 
canonical search solutions ncluding model checking [[3][33), 
AI graph planning tools 0 do not scale beyond networks 
with 5-10 stateful NFs; e.g., model checking took 25 hours 
for a network with 6 switches and 3 middleboxes. Next, we 
describe how we make this test generation problem tractable. 
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7.1 Symbolic execution for abstract test plans 

Why Symbolic Execution (SE)? There are two key scala¬ 
bility concerns about test traffic generation. First, we need to 
search over a very large space of possible sequence of traffic 
units. While ADUs improve scalability as compared with 
IP packets (^5j via aggregation, we still have to search over 
the space of possible ADU value assignments. Second, the 
state space of the data plane is again very large. While the 
FSM ensembles abstraction significantly reduces the num¬ 
ber of states (§[6]), it does not address state space explosion 
due to composition of NFs; e.g., if the models of NF i and 
NF 2 can reach K\ and K 2 possible states for some ADU, 
respectively, the composition can reach K\xK 2 states. The 
traffic- and state-space explosion makes our problem (even 
to find abstract test traffic) challenging. 

To address this scalability challenge, we turn to symbolic 
execution (SE), which is a well-known approach in formal 
verification to address state-space explosion (30} . At a high 
level, an SE engine explores possible behaviors of a given 
program by considering different values of symbolic vari¬ 
ables (3T) One concern is that SE sacrifices coverage. In 
our specific application context, this tradeoff to enable in¬ 
teractive testinr is worthwhile. First, administrators may al¬ 
ready have specific testing goals in mind. Second, configu¬ 
ration problems affecting many users will naturally manifest 
even with one test trace. Finally, with a fast solution, we can 
run multiple tests to improve coverage. 

Mapping policy to assertions: For each policy scenario 
(TraceSpec; policyPath; Action) of the operator’s policy 
(& Armstrong uses SE as follows. First, we con¬ 
strain the symbolic ADUs to satisfy the TraceSpec con¬ 
dition. Second, we introduce the negation of policy Path, 
namely ^(policyPath), as an assertion in the network 
model code. In practice, given the policy and network 
topology, Armstrong instruments the network model with 
-i (policyPath) assertions expressed in terms of ADU fields 
(e.g., networkPort, c-Tags). Then, the SE engine finds 
an assignment to symbolic ADUs such that the assertion is 
violated^] Because we use the negation in the assertion, 
in effect, SE concretizes a sequence of ADUs that induce 
policyPath in the network model. This abstract test traf¬ 
fic generated by SE, after being translated into concrete test 
traffic and injected into the actual data plane, must traverse 
NFs specified in policyPath and result in Action; otherwise, 
the policy scenario is incorrectly implemented. 

Examples: To make this concrete, let us revisit Figure [3] in 
Listing [3j where we want a test plan to observe cached re¬ 
sponses from the proxy to Dept. Lines |35]|38| show the asser¬ 
tion to get a trace (i.e., a sequence of ADUs) that change the 
state of the data plane such that the last ADU in the abstract 
traffic trace: (1) is from host H 2 (Line [36}, (2) corresponds 

12 Note that an assertion of the form -<( Ai fl • • • fl A n ), or equiva¬ 
lently (~iAi U • • • U -iA n ), is violated only if each term Ai is eval¬ 
uated to true. 


Listing 4: Assertion pseudocode for Figure [ 5 ] to 
trigger alarms at both IPSes. 

1 

2 

3 

4 

5 

6 


// Global state variables 

int L_IPS_Alarm[noOfHosts ]}//alarm per host 
int H_IPS_Alarm[noOfHosts ];//alarm per host 

assert((!L_IPS_Alarm[A[i].srcIP]) || 

(!H_IPS_Alarm[A[i].srcIP])); 


to a cached respnose (Line [37}, and (3) reaches the network 
port where the monitor is attached to (Line [38}. For exam¬ 
ple, the SE engine might give us a test plan with 5 ADUs: 
three ADUs between a host in the Dept, and the proxy to 
establish a TCP connection (the 3-way handshake), a fourth 
ADU has httpGetOb j = httpOb j Id from the host to 
the proxy (a cache miss), followed by another ADU with the 
field httpGetOb j set to httpOb j Id to induce a cached 
response. Similarly, Listing [4] shows an assertion in Lines [5} 
[6] so that an alarm is triggered at both L-IPS and H-IPS of 
the example from Figure [5] 

7.2 Optimizing SE 

While SE is orders of magnitude faster than other can¬ 
didates as the search mechanism, it is still not suffi¬ 
cient for interactive testing; even after a broad sweep 
of configuration parameters and command line argu¬ 
ments (e.g., max-sym-array-size, max-memory, and 
optimize) to customize KLEE, it took several hours even 
for a small topology (jjfTO}. To scale to larger topologies, we 
implement two key optimizations: 

• Minimizing number of symbolic variables: Making an 
entire ADU structure (Listing [1} symbolic will force 
KLEE to find values for every field. To avoid this, Arm¬ 
strong uses the policy scenario to determine a small sub¬ 
set of ADU fields as symbolic; e.g., when it is testing 
data plane with a stateful firewall but without a proxy, 
it makes the HTTP-relevant fields concrete (i.e., non- 
symbolic) by assigning the don’t care value * (repre¬ 
sented by -1 in our implementation) to them. As another 
example, Armstrong sets a client’s TCP port number to 
a temporary value (as opposed to making the srcPort 
field symbolic). This value is only used in the model for 
test planning and the actual client TCP port is chosen by 
the host at run time ( §7.3} 

• Scoping values of symbolic variables: The TraceSpec 
already scopes the range of values each ADU can take. 
Armstrong further narrows this range by using the policy 
scenario to constrain possible values of symbolic ADU 
fields. For example, while tcpSYN is an integer ADU 
field, Armstrong restricts its value to be either 0 or 1 to 
shrink the search space. 

7.3 Generating concrete test traffic 

The output of SE is a sequence of ADUs ADUSeq SE , and 
our next goal is to translate it into concrete test packets. 
Since ADUs are abstract I/O units, we cannot directly in¬ 
ject them into the data plane. Moreover, we cannot simply 
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do a one-to-one translation between ADUs and raw pack¬ 
ets and do a trace replay (2); e.g., we need some session 
semantics for TCP or in an actual HTTP session several pa¬ 
rameters will be outside of our control (e.g., chosen by the 
remote server at the test run time). While we do not claim to 
have a comprehensive algorithm for translating an arbitrary 
ADUSeq SE into concrete test traffic, we use a heuristic ap¬ 
proach as follows. 

We have created a library using domain knowledge to map 
a known ADUSeq l into a test script. For instance, if we have 
an ADUSeq consisting of three ADUs for TCP connection 
establishment and a web request, we map this into a simple 
wget request with the required parameters (e.g., server IP 
and object URL) for the request indicated by the ADUSeq. 
In the most basic case, the script will be a simple IP packet. 
In our current implementation, we have manually popu¬ 
lated this library and currently use 11 such traffic generation 
primitive functions (e.g., closeTCP ( . ), getHTTP ( . ), 
sendIPPacket (.)) that support IP, TCP, UDP, HTTP, 
and FTP. Automating the task of populating such a trace li¬ 
brary is outside the scope of the paper. 

Now, given a ADUSeq SE , we use this library as follows. 
We partition the AD USeq SE based on srcIP-dstIP pairs (i.e., 
communication end-points) of ADUs; i.e., ADUSeq SE = 
U* ADUSeq t . Then for each partition ADUSeq h we do a 
longest-specific match (i.e., match on a protocol at the high¬ 
est possible layer of the network stack) in our test script 
library, retrieve the corresponding scripts for each subse¬ 
quence and then concatenate these scripts. We acknowledge 
this step is heuristic and creating a comprehensive mapping 
process is outside the scope of this paper. 

8 Test Monitoring and Validation 

After the test traffic is injected into the data plane, the out¬ 
come should be monitored and validated. First, we need to 
disambiguate true policy violations from those caused by 
background interference. Second, we need mechanisms to 
help localize the misbehaving NFs. While a full solution to 
fault diagnosis and localization is outside the scope of this 
paper, we discuss the practical heuristics we implement. 

Monitoring: Intuitively, if we can monitor the status of the 
network in conjunction with the test injection, we can check 
if any of the background or non-test traffic can potentially 
induce false policy violations. Rather than monitor all traf¬ 
fic (we refer to this as Monitor All), we can use the intended 
policy to capture a smaller relevant traffic trace; e.g., if the 
policy is involves only traffic to/from the proxy, then we can 
focus on the traffic on the proxy’s port. To further minimize 
this monitoring overhead, as an initial step we capture rel¬ 
evant traffic only at the switch ports that are connected to 
the stateful NFs rather than collect traffic traces from all net¬ 
work ports. However, if this provides limited visibility and 
we need a follow-up trial (see below), then we revert to log¬ 
ging traffic at all ports for the follow-up exercise. 

Validation and localization: Next, we describe our current 



Orig — Obs 

Orig Obs 

No interference or re¬ 
solvable interference 

Success 

Fail. Repeat on 

Orig — Obs us¬ 
ing MonitorAll 

Unresolvable interfer¬ 
ence 

Unknown; Repeat Orig using MonitorAll 


Table 1: Validation and test refinement workflow. 

workflow to validate if the test meets our policy intent, and 
(if the test fails) to help us localize the sources of failure oth¬ 
erwise. The workflow naturally depends on whether the test 
was a success/failure and whether we observed interfering 
traffic as shown in Table Q] 

Given the specific policy we are testing and the relevant 
traffic logs, we determine if the network satisfies the in¬ 
tended behavior; e.g., do packets follow the policy-mandated 
paths? In the easiest case, if the observed path Obs matches 
our intended behavior Orig and we have no interfering traf¬ 
fic, this step is trivial and we declare a success. Similarly, if 
the two paths match, even if we have potentially interfering 
traffic, but our monitoring reveals that it does not directly 
impact the test (e.g., it was targeting other applications or 
servers), we declare a success. 

Clearly, the more interesting case is when we have a test 
failure; i.e., Obs / Orig. If we identify that there was 
no truly interfering traffic, then there was some potential 
source of policy violation. Then we identify the largest com¬ 
mon path prefix between Obs and Orig\ i.e., the point until 
which the observed and intended behavior match and to lo¬ 
calize the source of failure, we zoom in on the “logical diff” 
between the paths. However, we might have some logical 
gaps because of our choice to only monitor the stateful NF- 
connected ports; e.g., if the proxy response is not observed 
by the monitoring device, this can be because of a problem 
on any link or switch between the proxy and the monitoring 
device. Thus, when we run these follow up tests, we enable 
MonitorAll to obtain full visibility. 

Finally, for the cases where there was indeed some truly 
interfering traffic, then we cannot have any confidence if the 
test failed/succeeded even if Obs = Orig. Thus, in this case 
the only course of action is a fall back procedure to repeat 
the test but with MonitorAll enabled. In this case, we use an 
exponential backoff to wait for the interfering flows to die. 

9 Implementation 

NF models: We wrote C models for switches, ACL de¬ 
vices, stateful firewalls (capable of monitoring TCP connec¬ 
tions and blocking based on L3/4 semantics), NATs, L4 load 
balancers, HTTP/FTP proxies, passive monitoring, and sim¬ 
ple intrusion prevention systems (counting failed connection 
attempts and matching payload signatures). Our models are 
between 10 (for a switch) to 100 lines (for a proxy cache) of 
C code. The main loop of network model, utility functions, 
and header files (e.g., ADU definitions and utility functions) 
have a total of fewer than 200 LoC. To put these numbers in 
context, the real-world middleboxes can range from 2K (e.g., 
Balance (TJ) to few 100K (e.g., Squid g§, Snort (15]). We 
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reuse common templates across NFs; e.g., TCP connection 
sequence used both in firewall model and proxy model. 

Validating NF models: First, we use a bounded model 
checker, CMBC on individual NF models and the net¬ 
work model to ensure they do not contain software bugs 
(e.g., pointer violations). This was a time-consuming but 
one-time task. Second, we used call graphs visualiza¬ 
tion (8|[llf| based on extensive, manually generated input 
traffic traces to check that the model behaves as expected. 

Test traffic generation and injection: We use KLEE with 
the optimizations discussed earlier to produce the ADU- 
level test traffic, and then translate it to test scripts that are 
deployed at the injection points. Test traffic packets are 
marked by setting a specific (otherwise unused) bit. 

Traffic monitoring and validation: We currently use of¬ 
fline monitoring via tcpdump (with suitable filters); we 
plan to integrate more real-time solutions like NetSight J40| . 
We use OpenFlow |48| to poll/configure switch state. 

10 Evaluation 


In this section, we show that: 

(1) Armstrong enables close-to-interactive running times 
even for large topologies ( §10.1| ); 

(2) Armstrong’s design is critical for scalability ( §10.2 ); and 

(3) Armstrong successfully helps diagnose a broad spectrum 
of data plane policy violations ( §10.3| ). 


Testbed and topologies: To run realistic large-scale exper¬ 
iments with large topologies, we use a testbed of 13 server- 
grade machines (20-core 2.8GHz servers with 128GB RAM) 
connected via a combination of direct lGbE links and a 
lOGbE Pica8 OpenFlow-enabled switch. On each server, 
with KVM installed, we run injectors and software NFs 
as separate VMs, connected via OpenvSwitch software 
switches. The specific stateful NFs (i.e., middleboxes) are 
iptables (7) as a NAT and a stateful firewall, Squid GD as 
a proxy, Snort (D as an IPS/IDS, Balance (l| as the load 
balancer, and PRADS CD as a passive monitor. 

In addition to the example scenarios from §[2| we use 8 
randomly selected recent topologies from the Internet Topol¬ 
ogy Zoo G3 with 6-196 nodes. We also use two larger 
topologies (400 and 600 nodes) by extending these topolo¬ 
gies. These serve as switch-level topologies; we extend them 
with different NFs to enforce policies. As a concrete pol¬ 
icy enforcement scheme we implemented a tag-based solu¬ 
tion to handle dynamic middleboxes (36) . We reiterate that 
the design/implementation of this scheme is not the goal of 
Armstrong; we simply needed some concrete solution. 


10.1 Scalability of Armstrong 


We envision operators using Armstrong in an interactive 
fashion; i.e., the time for test generation should be within 1- 
2 minutes even for large networks with hundreds of switches 
and middleboxes. 


Impact of topology size: We fix the policy size (i.e., the 
length the chain of stateful NFs in the policy) to 3, including 
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Figure 9: Test generation latency vs. topology size. 

a NAT, followed by a proxy, followed by a stateful firewall. 
The firewall is expected to block access from a fixed subset 
of origin hosts to certain web content. To each switch-level 
topology, we add a number of middleboxes (0.5 x #switches) 
and connect each middlebox to a randomly selected switch 
with at most one middlebox connected to each switch. There 
is also one host connected to each switch that will be used 
as the end point of policies. The smallest topology with 6 
switches (Heanet) has one instance of the policy chain (i.e., 
a NAT, a proxy, and a firewall); we linearly increase the num¬ 
ber of policy chains to test as a function of topology size. 

Figure [9] shows the average test traffic generation latency. 
(Values are close to the average we do not show error bars). 
In the largest topology with 600 switches and 300 middle- 
boxes (i.e., 100 policy chain instances), the traffic generation 
latency of Armstrong is 113 seconds. To put this in context, 
we also show the traffic generation time of a strawman so¬ 
lution of using CMBC 0 model checker on our data plane 
model. Even on a tiny 9 node topology with 6 switches and 
3 middleboxes this took 25 hours; i.e., Armstrong on 90x 
larger topology is at least five orders of faster than the status 
quo. Note that this result considers Armstrong running se¬ 
quentially; we can trivially parallelize Armstrong across the 
different policy scenarios. 

Impact of policy complexity: Next we consider the effect 
of policy complexity measured by the number of middle- 
boxes present in the policy. We fix the topology to have 92 
switches (OTE GLOBE). To stress test Armstrong, we gen¬ 
erate synthetic longer chains in which the intended action of 
each NF on the chain depends on some contextual informa¬ 
tion from the previous NF. Figure [TO] shows that even in case 
of the longest policy chain with 15 middleboxes, Armstrong 
takes only 84 seconds. Again to put the number in context 
we show the strawman. 
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Figure 10: Test latency vs. policy chain length. 

Break-down of test traffic generation latency: Recall 
from ^7] that test generation in Armstrong has two stages. 
We find that translating abstract test traffic into concrete test 
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traffic composes between 4-6% of the entire latency to gen¬ 
erate the test traffic; this is the case across different topology 
sizes and policy sizes (i.e., policy chain lengths) (not shown). 

End-to-end overhead: After Armstrong generates test traf¬ 
fic, it injects the test traffic, monitors it, and determines the 
result. The actual test on the wire lasts < 3 seconds in our 
experiments with 600 switches and 300 middleboxes. How¬ 
ever, we monitor the network for a longer 10-second window 
to capture possibly relevant traffic events. On our largest 
topology with 600 nodes this validation analysis took only 
87 seconds (not shown). 

10.2 Armstrong design choices 

Next, we do a component wise analysis to demonstrate the 
effect of our key design choices and optimizations. 

Code vs. models: Running KLEE on smallest NF code- 
base of around 2000 LOC (i.e., balance (T)) took about 20 
hours. In very small experiment with policy chain of length 
2 involving only one switch directly connected to a client, a 
server, a load balancer, and a monitor 0 , traffic generation 
time took 57 hours (not shown). 

ADU vs. packet: First, to see how aggregating a sequence 
of packets as an ADU helps with scalability, we vary file size 
in an HTTP request and response scenario. Then, we use 
Armstrong to generate test traffic to test the proxy-monitor 
policy (Figure [3 ) in terms of ADUs vs. raw MTU-sized 
packets. Figure TT| shows that on the topology with 600 
switches and 300 middleboxes test traffic generation latency 
increases vs. the size of the response. Because the number 
of test packets is dominated by the number of object retrieval 
packets, aggregating all file retrieval packets as one ADU 
significantly cuts the latency of the test traffic generation. 
(The results, not shown, are consistent across topologies as 
well as using FTP instead of HTTP.) 



10KB 100KB 1MB 10MB 

file size 

Figure 11: Effect of using ADU vs. packet for various 
request sizes. 

SE vs. model checking: Our results already showed dra¬ 
matic gains of Armstrong w.r.t. model checking on the raw 
code. One natural question is if model checking could have 
benefited from the other Armstrong optimizations. To this 
end, we evaluated the performance of an optimized CMBC- 
based model checking solution with Armstrong-specific op¬ 
timizations such as FSM Ensembles, ADUs, and other scop¬ 
ing and variable reduction optimizations (|7}. This opti¬ 
mized version was indeed significantly faster than before but 



Topology size (# of switches) 

Figure 12: Improvements due to SE optimizations. 

it was still two orders of magnitude slower than Armstrong 
(not shown). This suggests that while our abstractions are 
independently useful for other network verification efforts 
using model checking, these mechanisms are not directly 
suitable for the interactive testing time-scales we envision 
in Armstrong. 

Impact of SE optimizations: We examine the effect of 
the SE-specific optimizations (§[7]) in Figure |T2] To put our 
numbers in context, using KLEE without any optimizations 
on a network of six switches and one policy chain with three 
middleboxes took >19 hours. We see that minimizing the 
number of symbolic variables reduces the test generation la¬ 
tency by three orders of magnitude and scoping the values 
yields a further > 9 x reduction. 

10.3 End-to-end use cases 

Next we also demonstrate the effectiveness of Armstrong in 
finding policy violations. 

Diagnosing induced enforcement bugs: We used a “red 
team-blue team” to evaluate the end-to-end utility of Arm¬ 
strong in debugging policy violations. Here, the red team 
(Student 1) informs of the blue team (Student 2) of policies 
for each network, and then secretly picks one of the intended 
behaviors (at random) and creates a failure mode that causes 
the network to violate this policy; e.g., misconfiguring the L- 
IPS count threshold or disabling some control module. The 
blue team uses Armstrong to (a) identify that a violation oc¬ 
curred and (b) localize the source of the policy violation. We 
also repeated these experiments reversing student roles; but 
do not show these results for brevity. 

Table [2] highlights the results for a subset of these scenar¬ 
ios and also shows the specific traces that Armstrong gen¬ 
erated. Three of the scenarios use the motivating examples 
from ^2] In the last scenario (Conn, limit.), two hosts are 
connected to a server through an authentication server to pre¬ 
vent brute-force password guessing attacks. The authentica¬ 
tion server is expected to halt a host’s access after 3 consecu¬ 
tive failed log in attempts. In all scenarios the blue-team suc¬ 
cessfully localized the failure (i.e., which NF, switch, or link 
is the root cause) within 10 seconds. Note that these bugs 
could not be exposed with existing debugging tools such as 
ATPG [631, ping, or traceroutep] 

13 They can detect link/switch failure being down but cannot capture 
subtle bugs w.r.t. stateful/context-dependent behaviors. 
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“Red Team” scenario 

Armstrong test trace 

Proxy/Mon (Fig. [3}; S 1 -S 2 link is 

down 

Non-cached rqst from inside the 
Dept, followed by request for the 
same object from by another source 
host in the Dept 

Proxy/Mon (Fig. |3|; The port of 

Si connected to proxy is miscon- 
figured to not support OpenFlow 

HTTP rqst from Dept 

Cascaded NATs (Fig. |4j; 

FlowTags (36) controller shut 
down 

Hi attempts to access to the server 

Multi-stage triggers (Fig. [5); L-IPS 

miscounts by summing three hosts 

Hi makes 9 scan attempts followed 
by 9 scans by H 2 

Conn, limit.; Login counter resets 

Hi makes 3 continuous log in at¬ 
tempts with a wrong password 

Conn, limit.; Si missing switch 
forwarding rules from AuthServer 
to the protected server 

H 2 makes a log in attempt with the 
correct password 


Table 2: Some example red-blue team scenarios. 

Loops and reachability: Armstrong can also help in di¬ 
agnosis reachability problems as well. It is worth nothing 
that while checking such properties in stateless is easy (43), 
this does not extend to stateful data planes. We extended 
Armstrong to support reachability properties via new use of 
assertions. For instance, to detect loops we add assertions of 
the form: assert (seen [ADU . id] [port ] <K) , where 
ADU is a symbolic ADU, port is a switch port, and K re¬ 
flects a simplified definition of a loop that the same ADU 
is observed at the same port > K times. Similarly, to check 
if some traffic can reach PortB from Port A in the net¬ 
work, we initialize a ADU with the port field to be Port A 
and use an assertion of the form assert (ADU. port ! = 
PortB) . Using this technique we were able to detect syn¬ 
thetically induced switch forwarding loops in stateful data 
planes (not shown). 

11 Related Work 


There is a rich literature on static 
| [43} [47] [61] [6^. At a high 


Network verification: 

reachability checking 
level, these focus on simple properties (e.g., black holes, 
loops) and do not tackle networks with complex middle- 
boxes. NICE combines model checking and symbolic ex¬ 
ecution to find bugs in control plane software (32). Arm¬ 
strong is complementary in that it generates test cases for 
data plane behaviors. Similarly, SOFT generates tests to 
check switch implementations against a specification (45). 
Again, this cannot be extended to middleboxes. 


Test packet generation: The work closest in spirit to Arm¬ 
strong is ATPG |[63), which builds on HSA to generate test 
packets to test reachability. As we discussed earlier[2| it can¬ 
not be applied to our scenarios. First, middlebox behaviors 
are not “stateless transfer functions”, which is critical for the 
scalability of ATPG. Second, the behaviors we want to test 
require us to look beyond single-packet test cases. 


Programming languages: Other work attempts to generate 
“correct-by-construction” programs (25] [26] [38) • Currently 
their semantics do not currently capture stateful data planes 
and context-dependent behaviors. That said, our work in 


Armstrong is complementary to such enforcement mecha¬ 
nisms; e.g., active testing may be our only option to check if 
the network with proprietary NFs behaves as intended. 


Network debugging: There is a rich literature for fault lo¬ 
calization in networks and systems (e.g., (37|[5T][56||57| ). 
These algorithms can be used in the inference engine of 
Armstrong. Since this is not the primary focus of our work, 
we use simpler heuristics. 


Modeling middleboxes: Joseph and Stoica formalized 
middlebox forwarding behaviors but don’t model stateful 
behaviors ED- The only work that models stateful behav¬ 
iors are FlowTest (35) , Symnet [|59j, and work by Panda 
et al (52) . FlowTest’s high-level models are not compos- 
able and the AI planning approaches do not scale beyond 4-5 
node networks. Symnet (59) uses models written in Haskell 
to capture NAT semantics similar to our example; based on 
published work we do not have details on their models, veri¬ 
fication procedures, or scalability. The work of Panda et al., 
is different from Armstrong both in terms of goals (reacha¬ 
bility and isolation) and techniques (model checking). 

Simulation and shadow configurations: Simulation © 
emulation |[5][T0), and shadow configurations (24]] are com¬ 
mon methods to model/test networks. Armstrong is orthog¬ 
onal in that it focuses on generating test scenarios. While 
our current focus is on active testing, Armstrong’s applies to 
these platforms as well. We also posit that our techniques 
can be used to validate these efforts. 


12 Conclusions 

Armstrong tackles a key missing piece of existing network 
verification efforts—context-dependent policies and stateful 
data planes introduce fundamental expressiveness and seal- 
ability challenges for existing abstractions and exploration 
mechanisms. We make three key contributions to address 
these challenges: (1) a novel ADU abstraction for modeling 
network I/O behavior; (2) tractable modeling of NFs as FSM 
ensembles; and (3) an optimized test workflow using sym¬ 
bolic execution. We demonstrate that Armstrong can handle 
complex policies over large networks with hundreds of mid¬ 
dleboxes within 1-2 minutes. In doing so we take the “CAD 
for networks” vision one step closer to reality. 
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